Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory optimized dists_add_symmetric #18

Conversation

KushnirDmytro
Copy link

@KushnirDmytro KushnirDmytro commented Apr 11, 2023

I am proposing PR that fixes the old issues that mention 'CUDA out of memory error' upon running the evaluation script.

I figured out that this issue comes from a single function;
It is cosypose.lib3d.distances.dist_add_symmetric

It allocates tensors of sizes NxNx3 and NxNx1, where N is the number of points.

Yet the same could be achieved by rewriting the code a little bit.

Alternative solutions are also possible and working(tested).

  • converting tensors to float16
  • computing the distances pointwise (too slow, on CPU it is even slower)
  • computing in batches of points (another parameter, still slower)
    Those approaches are tested but not used in the current proposal. Yet, they could be added later if the issue arises again with larger point clouds or the requirement to run on constrained hardware.

Also, some distance functions from lib3d.symmetric_distances.py file could be optimized, as they compute similar distance functions.

This solution uses <0.25 of the original version's memory:
An experiment was performed on run_cosy_pose_eval.py pipeline.
Evaluated 30 objects from tless.bop version of the dataset.
MemExperiments

The experiment with RTX-2080(8Gb) was not clean because GPU was also used for system GUI runtime.
The scenario with TITAN-X(12Gb) was much cleaner - performed on a headless server.
The old version of the code fails on both setups, while the new one works on both.

The low threshold on use cases could be explained by memory usage for context data and fragmentation. The error is triggered by the requirement to allocate one very large contiguous Tensor. This PR fixes the

dists = dists[ids_row, assign, ids_col]
return dists
distances = torch.cdist(TXO_gt_points, TXO_pred_points,
p=2, compute_mode='donot_use_mm_for_euclid_dist')
Copy link
Author

@KushnirDmytro KushnirDmytro Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_mode='donot_use_mm_for_euclid_dist' is the important parameter value.
Checked on the debug data instance: for the default mode, it gives ~3% of points got a different closest point id.
Yet the difference is about 1e-4--1e-6, but the performance benefit is rather small.

@nim65s
Copy link
Member

nim65s commented Apr 12, 2023

Thanks @KushnirDmytro for this work !

Maybe @ElliotMaitre you could test this to double check ?

@ElliotMaitre
Copy link

I tested it, it works for me. This change allows to run the evaluation on tless dataset, on a RTX-3060 (12Gb). However, on the ycbv dataset, I still have the memory issue with CUDA out_of_memory. All in all, the improvement is still very noticeable !

Thank you for your contribution

@nim65s nim65s merged commit 1e45363 into Simple-Robotics:master May 4, 2023
@KushnirDmytro
Copy link
Author

@ElliotMaitre
Thank you for both: review and appreciation)

After your comment, I was puzzled by a reported YCBV dataset problem.
I downloaded it (with cosypose download script, the exact proposed version of the dataset), then successfully ran the evaluation (as proposed on the landing page of this repo) -- On RTX2080 with 8G, it works fine, memory consumption is modest.

Then checked the data:

  • TLESS has 30 objects. Point cloud size varies: from 4145 to 20801 pts in each.
  • YCBC (bop) has 21 objects. Point clouds are standardized: only 2621 pts in each (one object has 2620 to be precise).

It feels like you had CUDA out_of_memory issue for a different reason.
I observed several times when eval (or other script) is terminated during the ongoing computations, often the process hangs in the background and occupies GPU memory. I have a hypothesis, that you launched eval on YCBC while having a zomby TLESS-eval process in the background. This is a reproducible scenario, I checked that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants